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Abstract 

Histogram-valued variables are a particular kind of variables studied in Symbolic Data Analysis 
where to each entity under analysis corresponds a distribution that may be represented by a histogram 
or by a quantile function. Linear regression models for this type of data are necessarily more com- 
plex than a simple generalization of the classical model: the parameters cannot be negative still the 
linear relationship between the variables must be allowed to be either direct or inverse. In this work we 
propose a new linear regression model for histogram- valued variables that solves this problem, named 
Distribution and Symmetric Distribution Regression Model. To determine the parameters of this model 
it is necessary to solve a quadratic optimization problem, subject to non-negativity constraints on the 
unknowns; the error measure between the predicted and observed distributions uses the Mallows dis- 
tance. As in classical analysis, the model is associated with a goodness-of-fit measure whose values 
range between and I . Using the proposed model, applications with real and simulated data are pre- 
sented. 

Keywords: data with variability; linear regression; Symbolic Data Analysis; quantile functions; 
Mallows distance. 

1 Introduction 

Classical multivariate statistics studies data tables that summarize observations made on "statistical units" 
(individuals); each row of the table represents one individual and each of these individuals is characterized 
by different variables (in columns). The "values" attained by the variables may be real values if the 
variable represents the measurement of a quantity (quantitative variables) or a category if the variable is 
qualitative. As an example, let us have classical quantitative variables such as the age, weight and height 
of a particular football player. The observations of these data are typically represented in classical data 
tables, but how can we represent the result of the weight of the football player if we don't know his exact 
weight? And what if we are interested in studying the age, weight and height not of one single player 
but of a football team? In the first situation, the individuals are described by attributes whose associated 
values are quantitative values that cannot be "measured" with precision. In cases like this, we are in the 
presence of imprecise data. In the second situation we are interested in describing one class of individuals. 
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The "best values" attained by the variables that characterize each class are not real values or categories 
but sets of "values", intervals or distributions. Even though data with variability or uncertainty may be 
represented by the same type of elements, the meaning of these elements is different. For example, the 
interval [80, 82] may mean that the weight of one football player is between 80 and 82 Kg. On the other 
hand, the interval [75, 80] may represent the weights of all players from a given football team. In the first 
situation the interval represents the imprecision of the weight value, whereas in the second situation the 
interval considers the variability of weight values in the football team. 

In this research we will focus on situations where variability in data description occurs. The classical 
solution to analyze these data is to reduce the collection of records associated to each individual or class 
of individuals to one value, this may be the mean, mode or maximum/minimum; however, with this option 
the variability across the records is lost. In alternative to the classical analysis applied to these kind of 
data, Diday [ i i ] introduced Symbolic Data Analysis, where the term symbolic data refers precisely to data 
with variability. To understand the concept of symbolic data it is important to assess where variability 
comes from. The variability of the data might emerge due to the aggregation of observations [ ] that 
can be contemporary, if the records are collected in the same temporal instant or the temporal instant 
is not relevant, and temporal if the time is the aggregation criterion and the records are grouped along 
one unit of time, for example one day. In both situations, the initial data or micro-data, are organized 
in classical data tables where each individual, termed first-level unit, is described by classical variables. 
Depending on the type of aggregation, the construction of the symbolic data table is different. When the 
aggregation is temporal, the entities under analysis are the original first-level units, now characterized 
by sets of values originating from the records collected over a unit of time. In situations where the 
aggregation is contemporary, the entities - higher-level units - are classes of individuals (sets of first-level 
units) grouped according to specific characteristics. In this situation, the variables describing both the 
higher-level and the respective first-level units are the same; however the "values" that the variables take 
for each higher-level unit are now sets of values or functions obtained from the respective first-level units. 

Following the definition of Bock and Diday [ ], a symbolic variable F is a mapping Y : E defined 
on a set E of statistical entities (E = il — {1,2,..., to} when the individuals are first-level units or 
E = {Ci, C2, . . .} with Cj C fl when the individuals are higher-level units) and which takes its values 
in a set B. Henceforth in this work, when we use the term unit, we will be referring to a first-level unit 
or to a higher-level unit, according to the kind of prior aggregation of the micro-data used to build the 
symbolic data table. 

Similarly to the classical case, symbolic variables can also be classified as quantitative or qualitative, 
according to the nature of the elements of B. For quantitative symbolic variables, each unit is allowed 
to take a single value (single-valued variables); a finite set of values (multi-valued variables); an interval 
(interval-valued variables); or a mapping that can be a probability/frequency/weight distribution (modal- 
valued variables). 

In this paper, we will be dealing with a particular type of modal-valued variables, the histogram-valued 
variables. In this case, the values attained by the variable for each unit are empirical frequency distribu- 
tions or, more specifically, histograms, where the values in each subinterval are assumed to be uniformly 
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distributed. If we consider a symbolic variable where all units are associated to one only interval of 
real numbers (uniformly distributed) with probability/frequency/weight equal to one, then we are in the 
presence of interval-valued variables. 

As an example, consider a symbolic data table containing information about patients (adults) attending 
healthcare centers, during a fixed period of time. In healthcare centre A, the age of patients ranged from 
25 to 53 years old, in healthcare centre B, it ranged from 33 to 68 years old and in healthcare centre C, the 
age of patients ranged from 20 to 75 years old, so that the age is an interval- valued variable. Now consider 
another variable which records the waiting time for consultations. In this case, information is recorded for 
5 time lengths (0 to 15 minutes, 15 to 30 minutes,...), and the corresponding symbolic variable is therefore 
a histogram-valued variable (see Table 1). Notice that in this example the entities under analysis are the 
healthcare centers (higher-level units), for each of which we have aggregated information (contemporary 
aggregation), and NOT the individual patients attending each centre (first-level units). 



Healthcare centers 


Age 


Waiting Time (minutes) 


A 


[25, 53] 


{[0, 15[ , 0; [15, 30[ , 0.25; [30, 45[ , 0.5; [45, 60[ , 0; > 60, 0.25} 


B 


[33, 68] 


{[0, 15[ , 0.25; [15, 30[ , 0.25; [30, 45[ , 0.25; [45, 60[ , 0.25; > 60, 0} 


C 


[20, 75] 


{[0,15[,0.33; [15,30[,0; [30, 45[ , 0.33; [45, 60[ , 0; > 60, 0.33} 



Table 1: Data for three healthcare centers. 



Symbolic Data Analysis has achieved considerable development since the eighties of last century (see, for 
instance, [3], [4], [7], [12], [19]). Recently, there has been a growing interest in the analysis of histogram- 
valued variables, though still more research is developed for interval-valued variables. The methods 
proposed so far for the former are indeed, frequently, a generalization of their counterparts for the latter 
The main definitions of descriptive statistics for one, two or more histogram- valued variables have already 
been studied. Billard and Diday [4] defined mean, observed and relative frequency, empirical density 
function, empirical joint density function; for variance and covariance two definitions were proposed [3]; 
[4]; [5]; Irpino and Verde [14] defined distribution functions and joint distribution functions. 

The first definitions and methods for histogram-valued variables are generally obtained from the applica- 
tion of the classic concepts to the midpoints of the histograms' subintervals, using the respective weights. 
Furthermore, although the symbolic variables' values are distributions and not real numbers, the results 
of the application of these concepts are real numbers. For example, the mean of m observations of the 
histogram- valued variable, proposed by Billard and Diday [^], is a real number It should be noticed, 
however, that in recent years other works have been put forward where the "results" are already distribu- 
tions. For example, Irpino and Verde [ ' ] present an alternative definition of mean for histogram-valued 
variables, which produces a mean distribution, that they termed by barycentric histogram. 

Work with histogram-valued variables has been recently reported in different domains, such as Principal 
Component Analysis [21], [22]; Cluster Analysis [ ]; Time series [ ] and Linear Regression [ ], [ ]. 

The first linear regression model for histogram-valued variables was a generalization of the first model 
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proposed for interval- valued variables by Billard and Diday [3], [6]. Other models have also been pro- 
posed for interval -valued variables [17], [18]; however, these models present some limitations: firstly, 
they are based on differences between real values and do not appropriately quantify the closeness be- 
tween intervals; then, the elements predicted by the models may fail to build an interval; the most recent 
model imposes non-negativity constraints on the coefficients, therefore forcing a direct linear relationship. 
These limitations prevent a generalization of the models to histogram-valued variables, so that alternative 
models are being developed (see, e.g., [13], [23]). Our goal is to propose a linear regression model for 
histogram-valued variables allowing predicting distributions from other distributions, without forcing a 
direct linear relationship. 

The development of non-descriptive methods for Symbolic Data Analysis is still an open research topic 
for almost all kinds of symbolic variables. Notice, however, papers recently pubhshed proposing proba- 
bihstic models for interval-valued variables [8], [16]. 

The remaining of the paper is organized as follows. Section 2 introduces histogram-valued variables 
in more detail, and presents a short study about the space of the quantile functions. In Section 3, the 
problem of defining a linear regression model for histogram-valued variables is addressed. A model and 
a respective goodness-of-fit measure are proposed. Section 4 reports results of a simulation study and two 
examples that illustrate the application of the model. Finally, Section 5 concludes the paper, pointing out 
directions for future research. 



2 Symbolic Data Analysis: histogram data 
2.1 Histogram- valued variables 

Consider a symbolic variable F : — > B. The set of units E may he E = fl = {1,2,..., m} when the 
individuals are first-level units or = {Ci, C2, ■ ■ ■} with Cj C il when the individuals are higher-level 
units. Consider also the quantitative (single-value) variable Y defined on a set fl. If the aggregation of 
the observations is temporal, to each unit j G ^ corresponds the empirical distribution of the values that 
Y takes within a certain unit of time. If the aggregation is contemporary, to each unit j corresponds the 
empirical distribution of Y in Cj. As histograms are a usual representation of empirical distributions, 
this kind of symbolic variables are termed histogram-valued variables. More generally we can define 
histogram-valued variables as follows: 



Definition 2.1 Y is a histogram-valued variable when to each unit j corresponds a empirical distribution 
Y(j), that can be represented by a histogram [7], [4 ]: 



'YU), 



LY(j)2ilY(3h 



Yij)r 



(1) 



where Ly^)^ '^^'^ ^Y(j)i represent the lower and upper bound of the interval i; pji is the frequency 

-Y{j)i ' ^Y(j)i with i e {1, 2, . . . ,nj} , nj is the number of subintervals 



associated to the subinterval 
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for the f'"- unit, j ^ I, . . . ,m^pij ^ 1, LyU), < -^r(j), and Iy{j), < IyO),+i- 

Alternatively, Y{j) can be represented by the inverse of the cumulative empirical distribution function, 
also called quantile function ^yQ) A ■7-' 



2 W^2—W 



if < t < Wji 

if Wji <t< Wj2 



t — Wjn -l 



< 1 



(2) 



{0 if 1^0 
' and o,Y{j)i — -^Y{j)i ~ i-Y{j)i with i G {1, . . . , nj}; 

^Pjh if l^l,...,nj 
h=l 

jij is the number of subintervals in Y(j). 

Or, considering the subintervals of the histograms defined by the centers Cy(j). and half-ranges '''Y(j)^, 
the representation of the Y{j) can be given by 

^Y(j) = {[cy(j)i - rYU)i,CY(j), +rY(j),[,Pji;...; [cY{j)n, - rYij)n,,CY{j)n, +^y(i)n,] ,Pjn,} 

(3) 

or 



Y{j) 



{t) = { 



+ (24^ - 1) rY(,), 



if < t < Wji 

if Wji <t< Wj2 



(4) 



i — Wjn -1 \ 



< 1 



Any 0/ these representations of the empirical distribution that each unit takes can be termed histogram 
value. Henceforth, when we use the term distribution, we are referring to an empirical distribution of a 



continuous variable. Furthermore, it is also assumed that within each subinterval 
values for the variable Y for each unit j — 1, . . . ,m, are uniformly distributed. 



the 



If any of the weights pji with i > 1 is nullo, the function ^Y{j) doesn 't have inverse with domain between 
and 1. Consequently the function ^yjj) not continuous and has rij — 1 pieces. In this case it is not 
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possible to calculate the value q/^yQj(wji_i) but only lim 



When rij — 1 and for each unit j, Y{j) takes values only on the interval LY{j)^ ^Y(j) with frequency 
Pj — 1, the histogram-valued variable is then reduced to the particular case of an interval-valued variable. 
In this case, the quantile function is given by 



*yO)W= LYU) + {lYU)-LYij))t, with 0<t<l. (5) 

When we work with histogram-valued variables, it is important to note that for different observations, the 
number of subintervals in the histograms or the pieces in functions may be different; the subintervals of 
histograms ffy(j) are considered ordered and disjoint, and if this is not the case, it must be possible to 
rewrite them in the required form [26], [2]. 

Example 2.1 Consider the histograms 

ifx = {[1,3[,0.1;[3,5[,0.6; [5, 8], 0.3} 

and 

ily = {[0,1[,0.8;[1,4],0.2} 

that caracterize an unit for the histogram-valued variables X and Y, respectively. These histograms are 
represented in Figure 1: 



Figure 1 : Representation of the histograms Hx and Hy in Example 2. 1 . 
Alternatively, these histograms can be represented by their quantile functions (see Figure 2): 



l+crTX2 

^ + X 3 



if 0<t<0.1 
if 0.1<t<0.7 
if 0.7 <t<l 



t 

0.8 
1 + 



if < t < 0.8 
X 3 if 0.8 < t < 1 
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Figure 2: Representation of the quantile functions ^'^^ and "i/y^ in Example 2.1. 

It is important to bear in mind that in a histogram the lower bound of each subinterval is always less than 
or equal to the upper bound, LY{j)i — ^Y(j)i and the upper bound of the following subinterval is always 
greater or equal to the previous, lY(j)i ^ LY{j)i+i- Consequently, the quantile function that represents 
the empirical distribution is always a non-decreasing function in the domain [0, 1]. 

Many concepts and methods for histogram-valued variables have been defined using the representation 
of their realizations in the form of histograms [3], [4]. Only in more recent studies have these variables' 
values been represented as quantile functions [ 1 ], [I -t], ["'-l], [25]. When the distributions are represented 
as histograms, the choice of the arithmetic becomes crucial. The complexity of the arithmetics [ [26] 
that have been proposed so far for histograms was arguably the reason why the distributions started being 
represented as quantile functions. If we represent the distribution that each unit takes on a histogram- 
valued variable by a quantile function, then operations are simplified because, as quantile functions are 
piecewise functions, the adequate arithmetic for them is a function arithmetic. In this work the option is 
to represent the distributions by quantile functions. However, this representation raises other questions. 

To operate with quantile functions, it is necessary to define all functions involved with an equal number 
of pieces or, equivalently, to rewrite all correspondent histograms with the same number of subintervals. 
For this, it may be necessary to apply the procedure defined by Irpino and Verde [14]. In addition, it 
is important to avoid that the number of subintervals for each histogram becomes "too" large (which 
could happen by applying the process proposed by Irpino and Verde [ 1-^]), in which case the distributions 
that represent the data would be meaningless. To prevent this situation, we may consider the sugges- 
tion of Colombo [ ] who encountered similar problems, and has considered advantageous to work with 
equiprobable histograms (histograms of equal probability subintervals). 

2.2 The space of quantile functions 

Quantile functions are a particular kind of functions. If we consider the set of the functions defined from 
MinM, J^(M,M) and the usual operations defined in : addition [f + g){x) — f{x) + g{x),yx E M and 
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product of a function by a real number (A/) (a;) — A/(a;),Vx G M, and A G M, it follows that (J^,+,.)isa 
vector space. However, if we consider the particular case of the set of the quantile functions, f ( [0, 1] , M) , 
defined on [0, 1], we don't have a subspace of the vector space {T, +, .). Analyzing the behavior of these 
operations it is possible to understand why £{[0, with the usual operations, does not verify the 
vector space definition. 

Consider the quantile functions VP^^ (t) and ^'y^ (t) defined according to (2) in Definition 2.1 both with n 
subintervals, after having been rewritten in accordance with the process described in [ 1 4] . These functions 
represent the distributions that the histogram-valued variables X and Y take for one unit. The addition of 
these quantile functions leads to the function 



if <t <wi 



+ Ly, + + ayj if wi<t<W2 



When we add two quantile functions we obtain a non-decreasing function. In this case both the slope and 
the y-intercept of the resulting function are influenced by the two functions. 

The particular case of the addition of a quantile function "ii^^ (t) with a real number a is the function 



O-Xi 'if ^ < t < Wi 

^/ wi < t < W2 



Wi 

t—wi 
W2—WI 



t — Wn-l 



ax„ if w„ 



<t<l 



In this case, only the y-intercept is affected by the operation, we have a translation up when adding a real 
positive number a and a translation down when the real number a is negative. 

The multiplication of the quantile function "ii^^lt) by a real number A leads to the function 



^Lx, + ^,i^^x,) if 0<t<w, 



^Lx„ + i-wUi (^"-xj if w„-i<t<l 
In this case, both the slope and the y-intercept are affected by A. If A is positive we will have a non- 
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decreasing function but if A is negative we will obtain a decreasing function that cannot be a quantile 
function, because quantile functions must always be non-decreasing functions. It is for this reason that 
the £([0, 1], M), is a semi-vectorial space. 

The following example illustrates this situation. 

Example 2.2 Consider the distribution represented by the quantile function ^^{t) presented in Exam- 
ple 2.1. If we multiply the quantile function "^^{t) by the positive real number 2, we obtain a non- 
decreasing function but if we multiply the quantile function '^^{t) by the negative real number —1 the 
resulting function is not a non-decreasing function. The following functions and representations in Figure 
3 illustrate this situation. 

< i < 0.1 
0.1 < t < 0.7 
0.7 < t < 1 

< i < 0.1 
0.1 < i < 0.7 
0.7 < t < 1 



^ ' 0.1 

6+ 

" ^ 0.6 



X 4 
X 4 



10- 



t-0.7 
n ^ 



X 6 



+ X (-2) 



-3 



0.6 



x(-2) 
x(-3) 











1 






— <( 


) — 


(t) — 
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Figure 3: Representation of the functions "$-^^{1), 2vl'^^(t), —^'^^{t) in Example 2.2. 

In conclusion, f ( [0, 1] , K) , is not a vector space because the elements of this space do not have symmetric 
elements. If we have a quantile function 'he function — ^'^^(t) is not a non-decreasing function 

and consequently cannot be a quantile function. However if we consider the distributions represented 
by histograms and use the histograms arithmetic proposed by Colombo [9] it is possible to obtain a new 
histogram, that is the symmetric of the histogram Hx- The histogram —Hx is the symmetric of the 
histogram Hx if —Hx and Hx are symmetric in relation to the yy— axis. 
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As an example of the situation above, Figure 4 represents the histogram Hx in Example 2.1 and the 
respective symmetric histogram. 
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Figure 4: Representation of the histogram Hx in Example 2.1 and the respective symmetric histogram 

-Hx. 

It is obviously possible to define the quantile function that represents the distribution of the histogram 
—Hx - This quantile function is — ^'^^^(l — t) with t G [0, 1] and is not the function obtained by multi- 
plying the quantile function '^^^{t) by —1. Figure 5 shows that the function — ^'^^^(t) in Example 2.2 is 
different from the quantile function —5*^^(1 — t) that corresponds to the histogram —Hx- 
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Figure 5: Representation of the functions -^(1)^ —^^(t)^ and — ^'^.^(1 — i), in Example 2.2. 
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To conclude this section, it is important to underline some conclusions about the function — — t), 

<e [0,1]: 

• As it is required for quantile functions, — ^^^^(1 — i) is a non-decreasing function; 

• — — t) is not a null function, as expected, but is a quantile function with null 
(symbolic) mean [4]; 

• the functions — 5*^^(1 — t) and 'i'j}{t) are linearly independent, providing that — ^'^^^(1 — t) =/= 

• — ^'^^(1 — t) = "ii^^lt) only when the histogram Hx is symmetric with respect to the yy— axis 

3 Linear Regression Model for histogram-valued variables 

The first linear regression model for histogram-valued variables was proposed by Billard and Diday [3]. 
This model is a generalization of the Center Model Y} defined by the same authors for interval-valued 
variables but, with this model it is possible that the predicted results are not histogram values. Because 
of this, recently same studies have emerged in an attempt to find new proposals for a linear regression 
model for this kind of variables. A recent model has been proposed by Verde and Irpino [ ]. 

Our main goal in this work is to propose a linear regression model for histogram-valued variables. More 
precisely, to provide a linear regression model that considers data with variability and allows predicting 
histogram values. 

To define this model, three problems need to be solved: 

• Find an error measure to quantify the difference between the observed and predicted distributions 
represented by histograms or quantile functions; 

• Define a linear regression model for histogram-valued variables that allows predicting histograms 
or their quantile functions from other histograms or quantile functions, without forcing a direct 
linear relationship; 

• Measure the goodness-of-fit of the model. 

3.1 Error measure 

In classical linear regression, to quantify the error between the observed values Uj and the predicted values 
yj the difference between two real numbers, Cj — yj — yj is used. In this case, the model to estimate 

m 

the values yj minimizes the quantity ^^(j/j —yjY- However, due to the complexity of histogram- valued 
variables, the error between the observed and predicted distributions requires a different approach. 
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In their work about forecasting time series, applied to histogram-valued variables, Arroyo and Mate [1], 
[2] also needed to measure the error between the observed and forecasted distributions. Therefore, they 
sought for a good measure to analyze the similarity between two distributions. Firstly, they considered 
the possibility of computing the difference between two distributions represented by their respective his- 
tograms using the histograms' arithmetic. However, this option turned out to be of little use. As we have 
seen, it is not easy to operate with the histograms arithmetic and some results are not as expected. This 
shows that it is not adequate to analyze the similarity between distributions with this concept. The options 
of those authors were to use dissimilarity measures for distributions and they opted for the Wasserstein 
and Mallows distance [15], [I] to measure the diference between the observed and forecasted distribu- 
tions. The justification for the choice of the Wasserstein and Mallows distance was the fact that they are 
distances and thus present interesting properties for error measurement: positive definiteness, symmetry, 
and triangle inequality condition. On the other hand, for Arroyo and Mate [1], [2], the Mallows distance 
is the one that better adjusts to the concept of distance as assessed by the human eye. This distance was 
also used in other works such as Irpino and Verde [ ], where the Mallows distance is used to determine 
the barycentric histogram and is then successfully applied to cluster histogram data. The same authors 
used this distance in their hnear regression model for histogram-valued variables [23]. 

In using the Wasserstein and Mallows distances, the distributions taken by the histogram- valued variables 
are represented by their quantile functions. These distances are defined as follows: 

Definition 3.1 Given two quantile functions '^'^^^^{t) and that represent the distributions that 

the histogram-valued variables X and Y take at unit j, the Wasserstein distance is defined as: 



Instead of using the quantile functions that represent the distributions, Irpino and Verde [14] rewrote the 
Mallows distance using the histograms, more specifically the centre and half-range of their subintervals. 
The square of the Mallows distance can be also defined as follows: 

Property 3.1 Consider two histogram-valued variables X and Y. The distributions that these variables 




(6) 



and the Mallows distance: 




(7) 



take for a given unit j, can be represented by the quantile functions ^^^^.^(t) and or the his- 

tograms Hx(j) and HY(j) - The square of the Mallows distance between these distributions is given by 
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where, relatively to the histogram-valued variables X or Y for unit j : 

• cx(j)i = '^nd Cy(j). — are the centers of the intervals i, with i e 
{l,...,n}; 

• '''x{j}i = ^<3). and rY[j)- — 2""^'^'' '^'"^ ^'^^ half-ranges of the intervals i, with 
ie{l,...,n}. 

It seems therefore appropriate to choose the Wasserstein or Mallows distance to measure the similarity 
between the observed and predicted distributions by the linear regression model. Because of the properties 
of the absolute value function we choose to define the error measure between two distributions with the 
Mallows distance. 

Definition 3.2 Consider, for each unit j, ^^J^.^(t) the quantile function of the observed distribution Y{j) 
and the quantile function that represents the predicted distribution Y[j). The error between 

Y{j) and Y{j) is defined by: 

SE{j) = Dlj(^-\^^{t),^^^^^^{t)) (8) 
The total error is the sum of the errors, that according to Property 3.1, may be written as follows: 



(9) 



3.2 The DSD Regression Model 



The first option to define the functional linear relation between histogram data was to adapt the classical 
model to these data. Consider that we want to predict the distributions that the histogram-valued variable 
Y takes from p histogram-valued variables Xk with k G {1, • • ■ ,p}- At unit j, j e {!,..., m}, the 
predicted distribution Y{j) would than be obtained as follows: 

Y{j) = 7 + aiXiij) + a2X2{]) + ... + apXp{j). 

As already mentioned, in this work we choose to represent the distributions by quantile functions. How- 
ever, when we multiply a quantile function by a negative number we do not obtain a non-decreasing 
function. Therefore, it is necessary to impose positivity restrictions on the parameters of the model. De- 
noting by ,^{t) the quantile function of the predicted distribution Y{j), we obtain the linear regression 
model as follows: 

with I3k >0 and ke {l,2,...,p}. 
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The non-negativity constraints imposed on the coefficients force a direct linear relationship, and limita- 
tions similar to those present in linear regression models defined for interval-valued variables occur (see, 
e.g., [ ' ]). Although we did not generalize the model for interval -valued variables to histogram-valued 
variables, in defining a model that allows to predict a quantile function from other quantile functions, we 
obtain a model with the same limitations as observed before. 

It is not possible to have negative parameters in the previous model. Neverthless, it is fundamental to 
allow for the possibility of a direct and an inverse linear relation between the variable Y and the variables 
Xk- For this reason, our proposal is to include in the linear regression model both the quantile functions 
^'^fc(j)^^-'' ^^^^ represent the distributions that the histogram-valued variables take for each unit j, 
and the quantile functions that represent the respectively symmetric histograms — "^x], ^^^^ 
Section 2.2). Therefore we proposed the following model: 

Definition 3.3 Consider the histogram-valued variables Xi; X2', . . . ; Xp. The quantile functions that 
represent tlie distribution that these histogram-valued variables take for each unit j are '^^^^■^{t)^ 
• ■ • ! ^"'^ ^'^^ quantile functions that represent the respective symmetric histograms 

associated to each unit of the refered variables are (^^Oi ~^X2(j) (j) (I^Oj 

with t e [0, 1]. Each quantile function '^Y[j)^ be expressed as follows: 

^yUt) = ^yl^{t)+e,{t). 
where 5*- J {t) is the predicted quantile function for unit j, obtained from 



with t e [0, 1] ; a.k,Pk > 0, fc e {1, 2, . . . ,p} and^ G M. 

The error, for each unit j, is the piecewise function given by Sj (t) = ^I'yJ (t) — 'i'-].. (t). 

For each unit j, the predicted distribution Y{j) can be represented by the quantile function ^P^l or by 
the respective histogram Hy^j^ ■ This linear regression model will be named Distribution and Symmetric 
Distribution (DSD) Regression Model. 

Consider the particular case of the linear regression model where there is only one explicative histogram- 
valued variable X. In this case we can obtain the quantile function ^f^J [t], for each unit j, by the model: 



(^) = 7 + c^^xu) (t) - (1 - + W (10) 

with a, /3 > 0, and 7 e M. 
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When including in the model both the distribution of the explicative histogram-valued variables, and the 
respective symmetric distributions, the restrictions on the parameters are imposed; however, this does not 
imply a direct linear relationship. In the particular case of (10), we consider that the linear regression is 
direct if a > /3 and inverse if a < /?. 



3.3 Parameters of the DSD Regression Model 

In classical statistics, the parameters of the linear regression model are estimated solving the minimization 

m 

problem ^^(j/j — where yj are the observed and yj the predicted values, respectively, with j e 
{1, . . . , to}. To solve this problem the least squares method is used. 

For histogram-valued variables the parameters of the DSD Model, in Definition 3.3, are estimated solving 
a quadratic optimization problem, subject to non-negativity constraints on the unknowns. 

Definition 3.4 Consider (t) obtained by the DSD Model. The quadratic optimization problem is 
written as: 

ra 

Mimraize SE ^M(*yL) ^^W^^^^ 

with afc, > 0, fc € {1, 2, . . . ,p} and^ G M. 

To present more specifically the function to minimize, it is important to define all the quantile functions 
involved in this expression considering the conditions referred in Section 2.1. The quantile functions that 
represent the distributions taken by Xk and the respective symmetric, for a given unit j are, respectively: 



it) 



if <t <wi 

if Wi < t < W2 



(11) 



. ^x,uu + {'2y^}^ - l) rx,(j)„ if w,,_ 



<t<l 



-cx,0)„ + (2^ - 1) rx,0)„ ^f 0<t<wi 

+ (2;f^ - 1) rx,o)„_, z/ w,<t<W2 

. -CXfc(j)i + (21^^ - 1) ^X.O)i Wn-l<t<l 



(12) 



15 



According to the DSD Model, the quantile function that represents the distribution taken by the predicted 
histogram- valued variable F , for a given unit j is: 



p / t \ ^ 

k=l ^1 ' fe=l 



j2 ("fc'=XfeO)2 -/3<eCX;,0)„_i) + 7+ (2 



t — wi 



1) ("'»''XfcO)2 +^''''^fc(j)n-l) Wi<t<W2 



i/ tu„_i < t < 1 
(13) 



Similarly, for unit j, the quantile function that represents the distribution taken by the histogram-valued 
variable, Y is 



, cyO)„ + (2!^^) rY(j)^ if Wn-i <t<l 



(14) 



Consider these quantile functions and the Mallows distance defined according to Property 3.1. The 
quadratic optimization problem presented in Definition 3.4 can then be rewritten as follows: 



Minimize SE 



/c=l 

subject to a/c, > 0, /c € {1, 2, . . . ,p} and 7 € K 



Or, in matricial form: 



(15) 



Minimize SE ^ -B'^ HB + F'^ B + C 



(16) 



subject to — ttfc, — /3fc < 0; fc e {1, 2, . . . ,p} and 7 e 
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In this latter case, H = [hiq] is the hessian matrix, a symmetric matrix of order 2p+l, with p the number 
of variables Xk - The elements of the symmetric matrix H are defined as follows: 



J2J2pA (i)„-.+iCx,+i (j), + -rxj^ (i)„-i + i''x,+i (i), 

3 = 1 i = l \ 2 2 

j=l i=l 2 
j=li=l 2 



i/ /,^^ flre odd and I, q < 2p 

if l,q are even and l., q ^ 2p 

if I is even, q is odd and I, q < 2p 

if q is odd and / — 2p + 1 

if q is even and I — 2p + 1 



The vector column of independent terms, F — [fi] with 2p + 1 rows is given by: 

if I is odd and I < 2p 



j = li = l ^ 2 2 



fl = { 



j = l i = l 
m n 



(2cy(j),cx^(j)„_.+i - :^''•YU)JXlJ^U),^-^+l) I i--^ even and I < 2p 



if l = 2p+l 



The elements of the matrices H and F are computed from the first order partial derivatives of the function 
SE in (15). These derivatives are presented in Appendix A. Finally, the vector column of the parameters, 
and the real value C, are defined as follows: 



B = [ai Pi a-i ^2 



ttp /3p 7]^ 



and 



C 



For each particular situation, it is possible to solve this quadratic optimization problem, subject to non- 
negativity on the constraints, and find the optimal solution. Consider the optimal solution for this opti- 
mization problem. 



i?* = K ft* "2 ft* 



ft 7 



Afterwards, it is possible to predict the distributions ^(j), for each j e {1, . . . considering the 
obtained matrix B* . Each predicted distribution may be represented by the quantile function as in (13) or 
by the respective histogram 
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H 



Consider the minimization problem defined in (15) or matricially in (16). The optimal solution of the 
quadratic optimization problem, subject to non-negativity constraints, verifies the Kuhn Tucker conditions 
[27]. Therefore, the optimal solution B* for this optimization problem, for all fc e {1, . . . ,p} verifies the 
following conditions: 

dSE(B') . p.. dSEjB') ^ ^. dSEjB') _ p.. dSEjB') _ n- dSE(B') a* _ n. 

From the Kuhn Tucker conditions, it is possible to prove some properties associated with the predicted 
distribution. Some of these are the counterparts of the corresponding properties in classical statistics, 
and will allow defining a measure to evaluate the goodness-of-fit of the model. Before describing these 
properties, it is necessary to present two important definitions of the concept of mean for histogram-valued 
variables. 

Definition 3.5 [ ] Consider the histogram-valued variable Y. For each unit j, with j e {1, . . . , m}, 

Y{j) may be represented by the histogram defined in (4). The mean of variable Y is defined as follows: 

j=i \i=i J 

where nj is the number of subintervals for the j*'* unit. 

Irpino and Verde [ i ■ ] defined the barycentric histogram as the histogram that is at a minimum distance - 
in the sense of the Mallows distance - of the m distributions. In this case, a mean distribution is obtained 
instead of a mean that is a real number 

The quantile function of the barycentric histogram is the same as the mean quantile function, that is 
computed from the average of the m quantile functions that represent the m given distributions. The 
mean quantile function is defined as follows: 

Definition 3.6 Consider the m quantile functions ^y(j) (t), j G {Ij ■ ■ ■ ^ ™}i ^^ZZ defined with n pieces. 
The mean quantile function 5'y^(i) is the function where each piece is the mean of the corresponding m 
pieces involved. The function is then, 
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if <t <wi 

if Wi < t < W2 
if Wn-l <t<l 



So. we have ^y\t) ^ ^YU)it)- 

These two concepts of mean for histogram-valued variables are related as we can see in the following 
proposition. 

Proposition 3.1 Considering the mean quantile function (t) of the histogram-valued variable Y and 
its mean Y , we have 

Y^ f ^(t)dt. 
Jo 

This result is due to Irpino and Verde [24] and may easily be proved considering Definitions 3.5 and 3.6. 

Now, considering the previous results and the Kuhn Tucker conditions, we may prove the following 
properties. 

Property 3.2 For each unit j, let Y(j) be the distribution predicted by the DSD Model and consider 
the parameters obtained for the optimal solution B* = [al a2 ■■■ a*i Pn ■ The 

mean of the predicted histogram-valued variable Y is given by: 

p 

k=l 

Proof: Each observation j, of the predicted histogram-valued variable Y{j), can be represented by the 
quantile function as in (13) considering for parameters the optimal solution B* , of the quadratic opti- 
mization problem in (15). As such, the mean quantile function ^f^^ can be calculated by Definition 3.6. 

— ^ 

So, applying Proposition 3.1 we can prove that Y = (a^ ~ Pk) -^k + 7* • D 

k=l 

Property 3.3 The mean of the predicted histogram-valued variable Y is equal to the mean of the ob- 
served histogram-valued variable Y . 
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Proof: Consider the function to minimize in (15), 

V fc=l 



m n 

i=i »=i 



fc=i 



For the optimal solution i3* we have ^^^q^ ^ = 0. Consequently. 



2EEP' E"^^^M.)J -^EEp. E/5^x,„,(„_. + i, +2m7--2j]J]p.c^(„, =0 

3 = 1 i = l \fc = l / j = li=l \fc = l / 3 = 1 i = l 

^EEp-E-:^- E Ep.E^^ "^"'r'"^' +7'= E Ep.^ 



j = i i = i fc=i (j) = i i=i fc=i 



From Property 3.2, it follows that Y ^ ^ {al - Pi) Xk + "f* , so Y = Y . □ 



Property 3.4 For each unit j, the quantile function for the distribution Y(j) predicted by the DSD Model, 
can be rewritten as follows: 



fc=i 

Proof: In Property 3.3, we proved that 



fc=l 



k=l 



For the optimal solution B* , for each unit j, the quantile function predicted by the linear regression model 
DSD, in Definition 3.3, is given by 

fe=i 

which may be rewritten as 



fc=l 
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Property 3.5 For the observed and predicted distributions Y{ j) and Y{ j), with j € {1, . . . , m}, of the 
variable Y, we have 

Proof: The proof is given in Appendix B. 

3.4 Goodness-of-fit measure 

To complete the investigation of the hnear regression model for histogram-valued variables, a goodness- 
of-fit measure remains to be deduced. We define this measure in a similar way as in the classical model 
for real data. 

Proposition 3.2 The sum of the square of the Mallows distance between each observed distribution j, 
j G {1, . . . , to}, of the histogram-valued variable Y, and the mean of the histogram-valued variable Y, 
Y, can be decomposed as follows: 

mm m 

Proof: Consider each observation j of the histogram-valued variable Y, represented by its quantile func- 
tion and the mean this histogram-valued variable, Y . We have, 

= E/ K)W-^fa)W)''^^ + E/ 

From Property 3.5 we have, 

m „i 

So, we can write 
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Therefore, similarly to the classical model, it is possible to define the goodness-of-fit measure of the DSD 
Model. 

Definition 3.7 Consider the observed and predicted distributions of the histogram-valued variable Y 
and Y represented, respectively, by their quantile functions '^Y{j){t) <^nd ^'pj .^{t), ond the mean of the 
histogram-valued variable Y, Y. The goodness-of-fit measure is given by 

m 

E^^.fe)W^^) 

^ = '^ 



In classical linear regression, the coefficient of determination R'^ ranges from to 1 . In this case, the 
goodness-of-fit measure, 51, also ranges from to 1 . 

Proposition 3.3 The goodness-of-fit measure fl ranges from to 1. 



m 



7 — 1 

Proof: Consider the goodness-of-fit measure fl ~ ^ . This measure is non-negative. 



So, n>o. 

From Proposition 3.2, we have 



Y.^l{M,^t),Y] 



3 = 1-'" J = l- 



Ill- pi 2 '"' pL 2 

Ej/ w) E/ w-^) 

rn "I 7n 



22 



n = 1 



3 = 1-^° 

rn 



Since the term 



E/ 



E^M(*y(,)(*)'^ 
J = l 



is non-negative, the value of is always less than or 



equal to 1. So, we have that < O < 1. 
Let us now analyze the extreme situations. 
Suppose 51 = 0. In this case. 



E^m(*?L)(*)'^)=0 



E 



1 , ^2 
1 — 



So, for all j e {1, . . . ,to} , we have - F = ■^=^ ^ ^' ^" '^^^^ predicted 

function for all observations j is a constant function. 

Suppose now that = 1. In this case. 



E^M(*p;,,w,F)^5:i?i,(*-,(t),F 

From the decomposition obtained in Proposition 3.2 we have. 



m ?n m 

rn 

E^m(*p;,,w,*.L)W 



3=1 



So, for all j e {1, . . . , m} . 



In this case, for each observation j, the predicted and observed quantile functions are coincident. 

In conclusion 0<il<l.Ifr2 = there is no linear relationship between the histogram- valued variable 
Y and the histogram- valued variables Xk- If fl = 1, the linear relation is perfect, so the relationship 
between the histogram-valued variable Y and histogram- valued variables Xk, with k G {1, ... is 
exactly the relation defined by the linear regression model. □ 
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4 Experiments 



To illustrate and analyze the DSD Model we performed a simulation study and applied the method to real 
datasets. 

4.1 Simulation study 

To analyze the behavior of the parameter estimation and the performance of the DSD Model in differ- 
ent situations, we performed a simulation study. The first step was to generate the observations of the 
histogram-valued variables X^, fc = {1, . . . ,p} and Y, where Y is the variable to be modelized from 
by the linear relationship. Next, the parameters were estimated by the DSD Model and goodness-of-fit 
measures computed, considering symbolic simulated data tables covering different situations. From these 
results it was possible to analyze the behavior of the model and draw some meaningful conclusions. 

4.1.1 Building symbolic simulated data tables 

The observations of the explicative and response histogram-valued variables Xk and Y were generated 
in different ways. 

• The observations of each histogram-valued variable Xk are created. 

According to the concept of symbolic variables, to obtain the m observations associated to a 
histogram- valued variable Xk, we started by simulating 5000 real values corresponding to each 
unit. These values are then organized in histograms, that represent the empirical distribution for 
each unit. It was considered, without loss of generality, that in all observations, the subintervals 
of each histogram have the same weight (equiprobable) with frequency 0.10. This option is not re- 
strictive, and is also supported by the work of Colombo [9]. If we had not considered equiprobable 
histograms with the same weight in all observations, we would have obtained a large number of 
different weights and consequently the subintervals would have very low frequencies. It is possible 
that histograms are not equiprobable, however, the weight in each subinterval has to be the same 
in all observations (see Subsection 2.1). Furthermore diversity of weights would lead to rounding 
errors that increase the difficulty to work with histograms. 

• The observations of the histogram-valued variable Y are created. 

The histograms that are the observations of the histogram-valued variable Y are obtained in three 
steps. First, we consider the perfect linear regression, without error, given by 

fe=l k=l 

for particular values of the parameters. The histogram-valued variables Xk and Y* are in a perfect 
linear relationship, this is however not what is intended to simulate a symbolic data table. Then, 
we disturb the perfect linear relationship by introducing an error function in the model '^y] ) (^) ~ 
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(i) ~'~ ■ ^^^^ function is a piece-wise linear fimction (but not necessarily a quantile 
function) defined by: 



e.W = { " ^ "'^"^^ ^ (17) 

L «o)i + + Er=2^ 26o), + (2 ~ "'""^ < * < 1 

Each quantile function \E'~i^^.j(t) is randomly disturbed by the error function for different values 
of a(j)j and . , i e {1, . . . , n} . These values might have a high or low variation depending on 
whether we want the linear regression between the variables to be better or worse. The selection 
of these values takes into account the "magnitude" of the values considered in each distribution 
values of . , cannot be lower than the minimum value of the half range — ''v'(j)i , 
else for this unit j and subinterval i, the half range ry* . would be negative. 

To perform the simulation study, symbolic data tables that illustrate different situations were created. 
For each situation considered, 1000 data tables were generated. In this study a full factorial design was 
employed, with the following factors: 

• Number of explicative histogram- valued variables: p = 1 andp = 3. 

• Parameters of the DSD Model. 

o For ■p=\ : 

1) a = 2; /3 = 1; 7 = —1; (a and /3 are close) 

ii) a = 2; /3 = 8; 7 = 3; (a is lower than /3) 

iii) a = 8; ^ = 0; 7 = 4; (a is higher than /3) 

o Forp = 3 : ai = 2; ^1 = 1; a2 = 0.5; ^1 = 3; as = 4; ^3 = 2; 7 = -1; 

• Distribution of the microdata that allow generating the histograms corresponding to each observa- 
tion of the variables X^, = {1, . . . ,p} : 

i) Uniform distribution 

(Xfc(j) ~W(<5iO),<52(j)) where for each j e {1, . . . , m} , <5i(j) ~ W(-2, 0) and SaO') ~ W(0, 2)); 

ii) Normal distribution 

(Xfc(j) ~Ar(//(j),(T2(j)) where for each j 6 {1, . . . ,m} , Mj) ~ "(0, 1) and a^^) ~ W(0, 2)); 

iii) Log-Normal distribution 

(Xfc(i) ~ lnjV{n{j),a'^{j)) where for each j £ {1, . . . , m} , ~ W(-0.5, 0.5) and a^{j) ~ W(0.5, 1)); 
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iv) Mixture of distributions, randomly selected from {Uniform:Xfe(j) ^U{1,3) ; Normal: Xk{j) ^ 
Af{l,l); Chi-square: Xk{j) ^ X'^il); Log-normal: Xk{j) ^ lnj\f {0,0.5); -Log-normal: 

Xk{]) ^ -lnMiO,0.5)} 

• Level of the linearity of the model: 

i) High linearity - In the error ej{t), the values of a(j)j and are randomly generated in 

Uci = U{—^ Cy ^LwdUri ~ ^(~| * ^'i'n{ry(j)i), \ * min{rY'{j)-)), respectively; 

ii) Moderate linearity - In the error ej{t), the values of a(j)-^ and are randomly generated in 

Uc2 — I * C*, I * (7) andUr2 ~ * "i*"-(''y(j)i)7 5 * TOm(ry.Q).)), respectively; 

iii) Low linearity - In the error ej{t), the values of a{j}-^ and bf^j-^. are randomly generated in 

Uc3 — U{—3*C,3*C) andUrs = Z^(—mm(ry*Q).), mm(ry.(j).)), respectively; 

• Sample size: m=10; 30; 100; 250. 

It is important to underline that in this simulation study, it was only possible to control the type of dis- 
tributions in observations of the explicative histogram-valued variables. This simulation does not allow 
selecting the distributions in the observations of the response variable. These distributions depend of 
the distribution of the variables Y* (that in some situations are known, as we will see later) and the 
disturbance applied to the histograms Y*{j). 

4.1.2 Description of the simulation study 

The simulated symbolic data tables include the observations of the histogram- valued variables X/. and Y, 
according to the previous description and factors. For these tables, we computed the estimated parame- 
ters for the DSD Model and the goodness-of-fit measures. As we considered 1000 replications for each 
situation, the values presented are the means of the obtained values and the respective standard deviation 
values (represented by s). 

The goodness-of-fit measures considered in this study are : 

• n, where fl is the measure deduced from the DSD Model (see Subsection 3.4); 

• Root-mean-square error {RMSEm), a measure defined using the Mallows distance (also used in 
the DSD Model), proposed by Irpino and Verde [ ' "]; it is defined by 

m » 1 2 

m 



RAISE 



M 



\ 



-T 
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Adaptations of the lower [RMS El) and the upper bound {RMSEjj) root-mean-square that Neto 
and Carvalho [17], [18], use to study the performance of the linear regression models defined for 
interval-valued variables; for histogram-valued variables, the RMS El and the RM SEjj are given 
by: 



rmsel = — 



J2W>' - Idh fpi RMSEu = 



^ m 

-E 



■A 



with [ and the subintervals i €E {1, . 

histograms, for each unit j. 



, n} of the observed and predicted 



In Appendix C four tables are presented, each of which containing the results obtained with p = 1 and all 
distributions used for defining the histogram values of Xj, i.e., all observations with Uniform distribution 
(Table 6), Normal distribution (Table 7), Log-Normal distribution (Table 8) and the observations of X{j) 
for a mixture of distributions (Table 9). In the last two tables, similar results are presented for the cases 
where p = 3 (Table 10, Table 11 ). 



4.1.3 Results and conclusions 

The main goals of this study are to analyze the behavior of the parameters' estimation and the performance 
of the DSD Model. The results obtained for the model with one or three explicative variables are similar, 
and as such in this subsection we will only be analyzing with detail the results obtained when p = 1. 
The results obtained when p = 3 may be found in Table 10 and Table 11 of Appendix C. For p = I 
it is also our goal to analyze how the symmetry/asymmetry of the distributions in observations of the 
explicative histogram-valued variable affect the symmetry/asymmetry of the distributions in observations 
of the predictive variable. 

Concerning the analysis of the parameters' estimation. 

For the simple case with one explicative histogram-valued variable X, we considered the mean of the 

values obtained for S,/?, 7 and the mean square error (MSE) [ ]. In this case, as we replicated the 

1000 

same situation 1000 times, M SE — {6 — 9i)^ (with 9 corresponding to each parameter of the 

i=l 

model). Comparing the first four tables in Appendix C, we can see that the behavior of the parameters' 
estimation is independent of the distribution used to generate the microdata of the explicative variables. 
Futhermore, the estimated parameters are almost always close to the initial parameter values irrespecive 
of the level of the linearity. This behavior is expected when the level of linearity is higher but when 
the level of linearity is moderate or low it would not be surprising if the estimated parameters were more 
distant from the original ones, since it seems natural that other models exist that better adjust the symbolic 
data. This is essentially observed in Tables 6 and 7, when the initial parameters a and f3 are not close 
and when the number of observations is lower. According to this, the analysis of the behavior of the 
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MSE and the mean of the estimated parameters is essentially applicable in situations where the level of 
linearity is high. For these cases we observe that the values of the M SE decrease and tend to zero as the 
number of observations increases and the mean of the estimated parameters becomes very close to the 
respective parameters of the model. These results confirm the empirical consistency of the estimators. 
In the boxplots presented in Figures 6, 7 and 8 we may observe that, considering the different types of 
distributions used to generate the histogram values of X, the boxes reduced their ranges around the true 
values of the respective parameters as the number m of observations increases. The figures illustrate only 
the situation when a — 2;/3 = 1;7 = —1, but the behavior for the other values is similar It can also 
be observed that the range of the variation of the estimated parameters relatively to its original value is 
influenced by the distribution of the observations. 
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Figure 6: Boxplot for the estimated parameter a for a high level of linearity. 
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Figure 7: Boxplot for the estimated parameter j5 for a high level of Unearity. 

Concerning the study of the goodness-of-fit measures. 

The values obtained for fl show that this value provides a good evaluation for the level of linearity. The 
models slightly disturbed presented values of O close to one. On the other hand, when the error function 
applied to the model presented a high level of variabiUty the values of fl are closer to zero. Furthermore, 
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Figure 8: Boxplot for the estimated parameter 7 for a high level of Unearity. 



the means of the values fl are consistent with the respective values of the measures RMS Em] RMS El 
and RAISEjj. In general, as expected, the highest values of the O correspond to the lowest values of 
RMS Em - In all tables of Appendix C we can also verify that in almost all situations, the values of the 
goodness of fit measure decrease in the same proportion as the levels of linearity. The level of linearity and 
the mean values associated to the goodness-of-fit measures RMSEm', RMSEl and RMSEy increase 
approximately four times when we pass from high to moderate linearity and approximately two times 
when we pass from moderate to low. This increase is an exact reflection of the range of variability tested 
in this study for the error function (four times from the high to moderate linearity and two times from 
moderate to low). 

Tables 2 and 3 illustrate the results that were obtained in an additional study for the original model with 
a = 2] ji — 1;7 = —1 and only for samples with 10 and 100 observations. Other situations were tested 
and the results were similar. The goal of the study was to analyze the level of sensitivity of the measure 
to different kinds of error functions, that in some cases affect more the half range of the subintervals 
of the histograms and in others the centers. To analyze this behavior, the values of D, were determined, 
considering different error functions that use three levels of variability for the values of a(^j)^ : Ud, Uc2, 
I4c3 as defined in Subsection 4.1.1 and, for each one, three levels of variability for : Uri^ Ur2, ^rs as 
defined in Subsection 4.1.1. 













''u)r "'■2 






6(3)_~ Ur2 


(.(,),- 


"(3)1 ""^ 

100 


0.9741 (0.0089) 
0.9648 (0.0032) 


0.9455 (0.0216) 
0.9322 (0.0076) 


0.8643 (0.0535) 
0.8403 (0.0191) 


0.9792 (0.0079) 
0.9762 (0.0025) 


0.9145 (0.0344) 
0.8982 (0.0122) 


0.7587 (0.0835) 
0.7160 (0.0293) 


"u)r 

100 


0.7323 (0.0727) 
0.6476 (0.0222) 


0.7163 (0.0786) 
0.6332 (0.0238) 


0.6690 (0.0906) 
0.5905 (0.0288) 


0.7980 (0.0583) 
0.7701 (0.0165) 


0.7567 (0.0691) 
0.7217 (0.0232) 


0-6555 (0.0997) 
0.6008 (0.0326) 


100 


0.4422 (0.1098) 
0.3195 (0.0251) 


0.4320 (0.1090) 

0.3156 (0.0265) 


0.4211 (0.1120) 
0-3054 (0.0268) 


0.5192 (0.0931) 
0.4627 (0.0243) 


0.5017 (0.0944) 
0.4436 (0.0260) 


0.4587 (0.1013) 
0-3970 (0.0306) 



Table 2: Mean values of ft considering different levels of linearity, when the distributions generating 
observations of X are Uniform (Jiu) and Normal (Jl_\f) . 
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(.(,)- W.3 


"(J) - "-1 


6y) Mr2 




]00 


0.9843 (0.0054) 
0.9822 (0.0019) 


0.8848 (0.0389) 
0.8786 (0.0146) 


0.6699 (0.0994) 
0-6571 (0.0393) 


0.9780 (0.0078) 
0.9719 (0.0026) 


0.9203 (0.0275) 
0.9042 (0.0095) 


0.7789 (0.0721) 
0.7403 (0.0220) 


10 

ay) ~Me2 

100 


0.8769 (0.0344) 
0.8656 (0.0112) 


0.8032 (0.0587) 
0.7765 (0.0229) 


0.6130 (0.1092) 
0.5982 (0.0420) 


0.7765 (0.0569) 
0.7225 (0.0182) 


0.7453 (0.0706) 
0.6838 (0.0223) 


0.6568 (0.0954) 
0.5884 (0.0287) 


100 


0.6542 (0.0762) 
0.6075 (0.0224) 


0.6114 (0.0923) 
0.5654 (0.0293) 


0.5067 (0.1208) 
0-4638 (0.0418) 


0.4884 (0.0948) 
0.3979 (0.0228) 


0.4791 (0.0956) 
0.3855 (0.0226) 


Q.^AII (0.1024) 
0.3526 (0.0260) 



Table 3: Mean values of considering different levels of linearity when the distributions generating 
observations of X are Log-Normal (f^LnA/') ™d a mixture of distributions (JIm) • 



Based on these results, we can say that, except when the observations of the explicative variables follow 
a Log-Normal distribution, the linearity between histogram-valued variables is more affected by distur- 
bances in the center of the subintervals than in the half range. This behavior is not surprising because 
the distance associated to this model is the the Mallows distance and as we observe in its definition the 
contribution to the centers of the subintervals is three times more then that of the half -ranges (see Defini- 
tion 3.1 and Property 3.1). On the other hand, when all observations of the explicative histogram-valued 
variable have asymmetric distributions, the influence of the disturbance in the center and half-range may 
be similar This different behavior may be related to the kind of distribution (symmetric/asymmetric) 
predicted for the observations of the histogram-valued variable Y{i), as we will see next. 



Concerning symmetry/assymetry ofY{j). 

In this simulation study it was possible to analyze the symmetry/assymetry of the predicted distribu- 
tions obtained by the DSD Model, taking into consideration the symmetry/asymmetry of the distribu- 
tions in the observations of the histogram-valued variables X and the values of the parameters a and f3. 
When the observation of the histogram-valued variable X has a symmetric distribution, represented by 
^'^l^.^(t), the respective symmetric distribution — ^^^^.^(1 — t) is also symmetric, but when the distri- 
bution '^x\j) (^) asymmetric positive (negative) (Log-Normal, for example), the respective symmetric 
distribution — (j) ^ ^) asymmetric negative (positive). In the DSD Model, the predicted distribu- 
tions are obtained from (t) = 7+"*xO) Therefore if the distribution (t) 
is symmetric the distribution of (i) also tends to be symmetrical, if the distributions ^x\j) (^) 
asymmetric the distribution of ^^^ ^(t) tends to be symmetrical when the values of a and /3 are close, 
asymmetrical negative (resp. positive) when the value of a is lower (resp. higher) than the value of 
f3. These conclusions are illustrated in Figure 9 considering all predicted distributions in the simulation 
study. 

In conclusion, when the distributions of observations X(j) are symmetric; asymmetric positive or as- 
symmetric negative, it is possible to forecast whether the distributions of Y{j) will be symmetric or 
asymmetric. 
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Figure 9: Boxplots of the difference between the mean and median of the estimated distributions in all 
situations with n = 10 considered in the simulation study, forp = 1. 

4.2 Applied examples 

4.2.1 The relation between the hematocrit values and hemoglobin values 

This first example was presented in [^] to illustrate their linear regression model for histogram-valued 
variables. In this case, we have the symbolic data in Table 4, where 10 units are described by two 
symbolic variables, the hematocrit and the hemoglobin. 



Obs. 


Hematocrit (Y) 


Hemoglobin (X) 


1 


{[33. 29; 37. 52[, 0.6; [37.52; 39.61] ,0.4} 


{[11.54; 12.19[ , 0.4; [12.19; 12.8] , 0.6} 


2 


{[36. 69; 39. 11[, 0.3; [39.11; 45.12] ,0.7} 


{[12.07; 13.32[ , 0.5; [13.32; 14.17] , 0.5} 


3 


{[36. 69; 42. 64[. 0.5; [42.64; 48.68] ,0.5} 


{[12.38; 14. 2[ , 0.3; [14.2; 16.16] , 0.7} 


4 


{[36. 38; 40. 87[, 0.4; [40.87; 47.41] ,0.6} 


{[12.38; 14.26[ , 0.5; [14.26; 15.29] , 0.5} 


5 


{[39.19; 50.86] , 1} 


{[13.58; 14.28[ , 0.3; [14.28; 16.24] , 0.7} 


6 


{[39.7; 44.32[ , 0.4; [44.32; 47.24] , 0.6} 


{[13.81; 14. 5[ , 0.4; [14.5; 15.2] , 0.6} 


7 


{[41. 56; 46. 65[, 0.6; [46.65; 48.81] ,0.4} 


{[14.34; 14.81[ , 0.5; [14.81; 15.55] , 0.5} 


8 


{[38.4;42.93[,0.7; [42.93; 45.22] ,0.3} 


{[13.27; 14. 0[ , 0.6; [14.0; 14.6] , 0.4} 


9 


{[28. 83; 35. 55[, 0.5; [35.55; 41.98] ,0.5} 


{[9.92; 11.98[ , 0.4; [11.98; 13.8] , 0.6} 


10 


{[44.48; 52.53] , 1} 


{[15.37; 15.78[ , 0.3; [15.78; 16.75] , 0.7} 



Table 4: Example of symbolic data table where the two variables hematocrit and hemoglobin are 
histogram-valued variables. 

We predicted the quantile function representing the distribution taken by the histogram-valued variable 
Y from the DSD Model, and obtained: 

W = -1-953 + 3.5598M'^| .(t) ~ 0.4128*^},. (1 - t) 
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The value of the goodness-of-fit measure is, for this case, Q = 0.96. 




0.3 0.4 0.5 0.6 0.7 1 0.3 0.4 0.5 0.6 0.7 1 



Observation 3 Observation 4 




0.3 0.4 0.5 0.6 0.7 1 0.3 0.4 0.5 0.6 0.7 1 



Observation 9 Observation 10 




Figure 10: Observed and predicted quantile functions of each observation in Table 4. 

In Figure 10 we may compare the quantile functions of the observed and predicted distributions of the 
histogram-valued variable Y. As it may be observed, the distributions are very similar, in agreement 
with the value of the coefficient of determination, fi. The observed and predicted histograms of each 
observation are presented in Appendix D. 

When we predict a histogram value we have always associated an error function defined according to 
Definition 3.3. For this example, in Figure 11 we can observe the error function for observations 1 and 3. 



Figure 11: Error function for the observations 1 and 3. 

The relationship between the histogram-valued variables in Table 4 may be visualized in the scatter plot 
for histograms in Figure 12. In this graphic, each of the distributions is represented by a histogram with a 
different color. These graphics show that a strong hnear relation between the histogram-valued variables 
hematocrit and hemoglobin is observed. 
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Figure 12: Scatter plot of the data in Table 4. 



From Property 3.2 we may conclude that for the set of patients to which the data refers, the symbolic mean 
of hematocrit increases a — (3 — 3.1470 for each unit of increase of the symbolic mean of hemoglobin. 
As this value is positive we may consider that the relationship between the histogram-valued variables is 
direct. 

For this example, we also predicted the hematocrit distributions using the linear regression models pro- 
posed by Billard and Diday [^] and Irpino and Verde [ I3],[23]. The hematocrit distributions obtained by 
these methods are presented in Appendix D. To compare the performance of the methods, the measures 
RMSEm, RMSEl,RMSEu (see Subsection 4.1.3) were used (see Table 5). 



Measure 


DSD Model 


BiUard-Diday Model 


Verde-Irpino Model 


RMSEl 


0.8806 


1.0288 


0.9220 


RMSEu 


0.8432 


1.1064 


0.8645 


RMSEm 


0.8946 


1.0507 


0.9145 



Table 5: Comparison of the performance between the DSD Model, the Billard-Diday Model and the 
Verde-Irpino Model. 

4.2.2 Distributions of Crimes in USA 

In this example we consider a real data table (microdata) [20] where we have records related with com- 
munities in the USA. The original data combines socio-economic data from the '90 Census and crime 
data from 1995. For this study we selected the response variable violent crimes (total number of violent 
crimes per 100 000 habitants) and four explicative variables: Xi (percentage of people aged 25 and over 
with less than 9th grade education); X2 (percentage of people aged 16 and over who are employed); 
(percentage of population who are divorced); X4 (percentage of immigrants who immigrated within the 
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last 10 years). To build the symbolic data table we aggregated the information (contemporary aggrega- 
tion) for each state. The units (higher units) of this study are the states of USA and their observations for 
each selected variable are the distributions of the records of the communities of the respective state. To 
build the initial data table we considered only the states for which the number of records for the variables 
selected was higher than thirty. Using this criterion, only twenty states were included (AL, CA, CT, FL, 
GA, IN, MA, MO, NC, NJ, NY, OH, OK, OR, PA, TN, TX, VA, WA, WI). Similarly to the simulation 
study, we consider, without loss of generality, that in all observations, the subintervals of each histogram 
have the same weight (equiprobable) with frequency 0.20. Furthermore as the response variable violent 
crimes admits only positive values and the distributions of these values are asymmetric, we will con- 
sider as response histogram- valued variable, the variable LVC whose observations are the distributions 
of the logarithm of the number of violent crimes for each USA state. Considering these conditions, the 
model that allows to predict the distribution of LVC from the distributions of the explicative variables 
Xi, X2,X3 and X4, for each USA state j is as follows: 



*r^0) W = ^-9^2^ + 0.0009*- - 0.0123*^i(^.)(l -t) + 

+0.2073*- it) - 0.0353*-! (^.^ (1 _ ^ 0.0187*-i(^.) (t) (18) 

with t E [0, 1]. The goodness-of-fit measure associated to this model is 51 = 0.87. 

The values of the parameters estimated for this situation allow to conclude that the variables Xi , X3 
and X4 have a direct influence in the logarithm of the number of violent crimes and the percentage of 
employed people have an opposite effect. From Property 3.2 we may conclude that, for the set of states 
to which the data refer, when the symbolic mean of the percentage of population divorced increases 1% 
and the other variables remain constant, the symbolic mean of the LVC increases 0.1720. The percent- 
age of divorced population is the one that influences the most the predicted histogram-valued variable. 
This interpretation can be extrapolated for the values of the associated parameter of all other explicative 
variables. 

The advantage of studying a linear relationship between data with variability is the possibility to predict 
the distribution of the values of the response variable instead of only one real value as in a classical study. 
In this example, the predicted distribution of the logarithm of the number of violent crimes for a given 
state is more informative about the criminality in that state than only one descriptive measure (e.g., the 
mean). 

Consider one state that was not used to build the model, the state of Arkansas (AR). It is possible to 
predict the distribution of LVC if the distributions of the explicative variables for this state are known. 
The histogram predicted by the DSD Model (18) for the state Arkansas is 
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Hlvc{AR) = {[4.2250, 5.3158], 0.2; [5.3158, 5.8887], 0.2; [5.8887, 6.4802], 0.2; 
[6.4802, 7.0509] ,0.2; [7.0509, 7.7913] , 0.2} 



Figure 13 illustrates the estimated and observed quantile function for this state and the values of the mea- 
sures RMSEm,RM SEi^, RM SEjj (see Subsection 4.1.3). The values of the goodness-of-fit measures 
prove the closeness between the observed and estimated quantile function that we may see in the figure. 




Figure 13: Observed and estimated quantile function of the variable LVC in the state of Arkansas 

Analyzing the predicted distribution, we may conclude that in the state of Arkansas the estimated distri- 
bution tends to an uniform behavior with the values of LVC to range between 4.23 and 7.79. 

The classical alternative to study the logarithm of the number of violent crimes in each USA state would 
be to reduce the records of all communities of each state, for example to the mean value and make a 
classical linear regression study. In this case, the variability of the records would be lost and the predicted 
results would be less informative. Considering the mean of the records associated to each community, the 
classical model is the following: 



LVC{j) = 6.5817 + 0.0705Xi(j) - 0.0503X2(j) + 0.0933X3(j) + 0.0177X4(j) (19) 



For this model the value of = 0.75. 

Considering again the state of Arkansas, with the previous model (19) the estimative for LVC{AR) is 
6.4511. With this approach the information about the behavior of the predicted variable is obviously 
poorer. 
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5 Conclusion and perspectives 



The DSD Model allows predicting the distributions taken by one histogram-valued variable from the dis- 
tributions taken by explicative histogram-valued variables. Moreover, it is possible to deduce a goodness- 
of-fit measure from the model. This measure appears to have a good behavior: when we compare the 
representation of the predicted and observed quantile functions for each unit we have good estimates 
when the value of the goodness-of-fit measure is close to one whereas the predicted and observed quantile 
functions are more discrepant when the value of the goodness-of-fit measure is lower. As interval-valued 
variables are a particular case of histogram-valued variables it is possible to particularize this model for 
interval- valued variables. An extension of the DSD Model, where instead of a real number we use a 
quantile function as the independent parameter, is imder development. This approach will be applied 
both to histogram-valued variables and interval-valued variables. With this new approach we expect to 
obtain a more flexible model. Finally, and as a future research perspective, other models and methods 
in Symbolic Data Analysis based on linear relationships between variables may now be developed using 
this approach. 
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Appendix D: Observed and predicted histograms of the experiments presented in 
Subsection 4.2.1. 



Histogram-valued variable Hematocrit in relation with the histogram-valued variable Hemoglobin 

In the example in Subsection 4.2.1 we performed a comparative study of the DSD Model with other 
existing models. The results of the appUcation of the models proposed by BiUard-Diday [3] and Verde- 
Irpino [23] to the data of this example may be found in Table 12. 



DSD Model 


^ "^-^^^ + 3.5598*-i^.)(t) - O.Al2m-\^^{l ~~ t) 


Billard-Diday Model 


= -2.16 + 3.16/yo), ?yO), = -2.16 + 3.16lxo)^ 


Verde-Irpino Model 


= "^-^^^ + 3.161X0-) + 3.918 - X{j)) 



Table 12: Linear regression models applied to the data in Table 4. 



In Table 13, in the white rows we have the observed histograms of each observation of the histogram- 
valued variable Y, in the light grey rows the histograms ^^y^sDli) predicted using the DSD Model, in the 
grey rows the histograms Hy^^^j-^ predicted using the model proposed by Billard and Diday [5] and in 
the dark grey the histograms H^^, predicting with the Verde and Irpino [23]. 
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Obs. 


Disliibu lions of the values of hematocrit 




{[33.29; 35. 41[ , 0.3; [35.41; 36.11[ , 0.1; [36.11; 36.82 [, 0.1; [36.82; 37.52 [, 0.1; [37.52; 38.04 [, 0.1; [38.04; 39.61] , 0.3} 




{[33.84; 35.70[ , 0.3; [35.70; 36.32[ , 0.1; [36.32; 36.73[ , 0.1; [36.73; 37.13[ , 0.1; [37.13; 37.56[ , 0.1; [37.56; 38.85] , 0.3} 




{[34.33; 35.87[ , 0.3; [35.87; 36.38[ ,0.1; [36.38; 36.70[ ,0.1; [36.70; 37.02[ ,0.1; [37.02; 37.35[ ,0.1; [37.35; 38.31] , 0.3} 




{[33.79; 35.7[, 0.3; [35.7! 36.34t , 0-l| [3^. 34; 36.73[', 0.1; [36.7^7. 13[ ,0.1; [37.13; 37.53i,b.i; {37.53; 38.73] ,0.3} 




{[36.69; 30.11[ , 0.3; [39.11; 39.97[ , 0.1; [39-97; 40.83[ . 0.1; [40.83; 41.69[ . 0.1; [41.69; 42.54[ , 0.1; [42.54; 45.12] , 0.3} 




{[35.16; 38. 04[ , 0.3; [38.04; 39.Q0[ , 0.1; [39.00; 39.96[ , 0.1; [39.96; 40.67[ , 0.1; [40.67; 41 .38[ , 0. 1; [41.38; 43.51] ,0.3} 




{[36.00; 38.37[ , 0.3; [38.37; 39.16[ , 0.1; [39.16; 39.95[ , 0.1; [39.95; 40.49[ , 0.1; [40.49; 41.03[ , 0.1; [41.03; 42.64] , 0.3} 




^■^(3) 


{[36.69; 40.26 [, 0.3; [40.26; 41.45 [, 0.1; [41.45; 42. 64[ , 0.1; [42.64; 43.85 [, 0.1; [43.85; 45.06 [, 0.1; [45.06; 48.68] , 0.3} 




{[35.45; 42.27[ , 0.3; [42. 27; 43.3S[ , 0. 1; [43.38; 44. 50 [, 0.1; [44.50; 45.61[ , 0.1; [45.61 ; 46. 72 [ , 0. 1; [46.72; 50.46] ,0.3} 




{[36.98; 42.74 [, 0.3; [42. 74; 43 .62[ , 0. 1; [43.62; 44. 51[ , 0. 1 ; [44.51; 45 .39[ , 0. 1 ; [45 .39; 46. 28[ , 0. 1; [46 .28; 48. 93] , 0. 3} 





{[35. 2942. 42 [ , 0. 3; [42.42; 43.51 [, 0. 1; [43. 51; 44.61 [, 0. 1; [44.61; 45.71[ , 0. 1; [45.71; 46 .S0[ , . 1; [46. 80; 50. 1] , 0.3} 


^V(4) 


{[36.38; 39.75[ , 0.3; [39.75; 40.87[ , 0.1; [40.87; 41.96[ , 0.1; [41.96; 43.05[ , 0.1; [43.05; 44.14[ , 0.1; [44.14; 47.41] , 0.3} 




{[35. 80; 40.08[ , 0.3; [40.08; 41.50[ , 0.1; [41. 50; 42.92[ , 0.1; [42.92; 43.S1[ , 0.1; [43.81 ; 44.70[ , 0.1; [44. 70; 47. 37] , 0.3} 




{[36.98; 40. 55 [, 0.3; [40.55; 41.74[ ,0.1; [41.74; 42 .93 [, 0. 1 ; [42.93; 43. 58 [, 0. 1 ; [43.58; 44.23 [, 0. 1; [44.23; 46.18] ,0.3} 




{[35.7140. 13 [ , 0.3; [40.13; 41 .61 [ , 0. 1; [41 .61 ; 43.08[ , 0. 1; [43.08; 43.89[ , 0. 1; [43.89; 44.69[ , . 1; [44.69; 47.12] ,0.3} 




{[39.19; 42. 69 [, 0.3; [42.69; 43.86 [, 0.1; [43.86; 45.03 [, 0.1; [45.03; 46.1 9[ , 0.1; [46.19; 47.36 [, 0.1; [47.36; 50.86] , 0.3} 




{ [39. 68: 42 .52[ , 0.3; [42. 52; 43.64[ , 0.1; [43.64; 44. 75[ , 0.1; [44.75; 45 .S6[ , 0.1; [45.86; 46.97[ , 0.1; [46.97; 50.25] , 0.3} 




{[40.78; 42.99 [, 0.3; [42.99; 43.S7[ , 0.1; [43.87; 44. 76 [, 0.1; [44.76; 45.64[ , 0.1; [45.64; 46.53 [, 0. 1; [46.53; 49.19] ,0.3} 




{[39.842,54[ , 0,3; [42.54; 43. 64[ , 0.1; [43.64; 44.74[ ,0.1; [44,74; 45. 83[ , 0. 1; [45,83; 46. 93[ , 0. 1; [46.93; 50.22] , 0.3} ^ 




{[39.70; 43.1 7[ , 0.3; [43. 17; 44.32[ , . 1 ; [44 . 32 ; 44 . 8 1 [ , . 1 ; [44.81; 45.29[ . 0.1; [45 . 29 ; 45 . 78 [ . . 1 ; [45.78; 47.24] ,0.3} 




{[40.93; 42.92[ , 0.3; [42.92; 43.58[ , 0.1; [43-58; 44.04[ , 0.1; [44-04; 44.51 [. 0.1; [44.51; 44. 99[ , 0.1; [44.99; 46.45] , 0.3} 




{[41.50: 43. 14 [. 0.3; [43. 14; 43 .6S[ , . 1; [43 . 68; 44 . 05 [ . . 1 ; [44.05; 44 . 42 [ , . 1 ; [44 . 42 ; 44 . 79 [ . . 1 ; [44 . 79 ; 45 . 90] , . 3 } 




{[40,9242.951 , 0.3; [42.95; 43.62[» 0.1; [43.62; 44.08j_, 0.1; [44.08; 44.54[, 0.1; [44.54; 44.99[ , 0.1; [44.99; 46.47] ,0.3} ^ 








{[41.56: 44. 11[ , 0.3; [44.11; 44.95 [, 0.1; [44-95; 45. 80[ , 0.1; [45.80; 46.65 [. 0.1; [46.65; 47.1 9[ . 0.1; [47.19; 48.81] , 0.3} 




{[42.67; 43. 86 [, 0.3; [43. 86; 44.26[ , . 1; [44.26; 44.65 [, 0. 1 ; [44.65; 45 . 22[ , 0. 1 ; [45 .22; 45. 78 [ , 0. 1 : [45 . 78; 47.48] , 0. 3} 




{[43.18: 44.07[ .0.3; [44.07; 44.37[ , 0.1; [44 . 37; 44 . 66 [ , . 1 ; [44.66; 45 . 1 3 [ , . 1 ; [45 . 13 ; 45 . 60 [ . . 1 ; [45 . 60 ; 47 . 00] , . 3 } 






«1'(S) 


{[38.4; 40-34[ , 0-3: [40.34; 40.99[ , 0-1; [40.99; 41.64[ , 0.1; [41.64; 42. 2S[ , 0.1; [42-28; 42.93[ , 0.1; [42.93; 45.22] . 0.3} 




{[39.26; 40.74 [, 0.3; [40.74; 41. 24[ , 0.1; [41.24; 41.72 [, 0.1; [41.72; 42. 20[ , 0.1; [42.20; 42.79 [, 0.1; [42.79; 44.54] , 0.3} 




{[39.80: 40.95[ , 0.3; [40.95: 41.33[ . 0.1; [41.33; 41.72[ . 0.1; [41-72; 42.10[ . 0.1; [42.10; 42.58[ . 0.1; [42.58; 44.00] , 0.3} 








«V(f.) 


{[28.83: 32.86[ . 0-3; [32 . 86: 34 . 2 1 [ , . 1 ; [34 - 2 1 ; 35 . 55 [ . . 1 ; [35 - 55 ; 36 . 84 [ . . 1 ; [36 . 84 ; 38 . 1 2 [ . . 1 ; [38 . 1 2 ; 41 . 98] ,0.3} 




{[27.66: 33.54[ , 0-3; [33.54: 35.50[ . 0.1; [35-50; 36.70[ , 0.1; [36-70; 37.91[ . 0.1; [37.91; 39.20[ , 0.1; [39.20; 43.08] , 0.3} 




{[29.20: 34.09[ . 0-3; [34.09: 35.72[ , 0.1; [35.72; 36.6S[ . 0.1; [36.68; 37.63[ . 0.1; [37.63; 38.59[ . 0.1; [38.59; 41.47] , 0.3} 










^Y{10) 


{[44.48: 46.90[ , 0.3; [46.90; 47. 70[ , 0.1; [47-70; 48.51 [. 0.1; [48-51; 49.31 [. 0.1; [49.31; 50.12 [. 0.1; [50.12; 52.53] , 0.3} 




{[45.85; 47.48[ ,0.3; [47.48; 48 .03[ , . 1; [48.03; 48. 58[ , 0. 1 ; [48.58; 49. 13[ , 0. 1 ; [49 . 13; 49. 68[ , 0. 1 ; [49.68; 51.33] .0.3} 




{[46.43: 47.73[ . 0.3; [47.73; 48.17[ , 0.1; [48.17; 48.61[ . 0.1; [48.61; 49.05[ . 0.1; [49.05; 49.4S[ . 0.1; [49.48; 50.80] , 0.3} 




{[45.9147.51[ , 0.3; [47.51; 48. 06[ , 0. 1; [48.06; 48. 6[ , 0. 1; [48.6; 49. 14[ , 0. 1; [49. 14; 49. 68[ , 0. 1; [49.68; 51.31] , 0.3} 



Table 13: Observed and predicted histograms (using three different methods) of the Hematocrit values 
for the data in Table 4. 
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